feat(accuracy): AIME 2025 lighteval-backed benchmark (AIP-876)#926
Conversation
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@3e85d67ae69a5294c4db72891ab72f09f9f6a2feRecommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@3e85d67ae69a5294c4db72891ab72f09f9f6a2feLast updated for commit: |
Stack dependencyThis PR is part of an 8-PR stack aligning aiperf's accuracy benchmarks with Merge order:
This PR: position 5 of 8 — base branch is After each upstream PR merges, the downstream PR's branch will be rebased |
e0576be to
cd239a5
Compare
358d5bd to
9bbe752
Compare
cd239a5 to
ed0edf6
Compare
9bbe752 to
599d8f4
Compare
ed0edf6 to
9eabc25
Compare
599d8f4 to
d7552b6
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
d7552b6 to
b8645f1
Compare
9eabc25 to
05ece47
Compare
05ece47 to
80e91dc
Compare
b8645f1 to
62d9fbe
Compare
Implement ``AIME25Benchmark`` mirroring the trt-llm benchmark recipe's ``acc_bench_lighteval.py:aime25`` configuration: same ``aime_prompt_fn`` zero-shot rendering, ``generation_size=32768``, ``hf_repo="yentinglin/aime_2025"``. Same shape as ``AIME24Benchmark`` just pointed at the 2025 mirror. The loader emits one ``BenchmarkProblem`` per dataset row with the bare problem text as ``prompt``, ``str(answer)`` as ``ground_truth``, and ``metadata.generation_size`` = 32768. ``tasks`` / ``n_shots`` / ``enable_cot`` are accepted for protocol uniformity but ignored. Pair with ``LightevalExprGrader`` for the recipe's ``expr_gold_metric`` extraction. Built on top of AIP-875 (lighteval sub-stack ordering: 875 → 876). No heavy optional dependency — ``datasets`` is core — so CI gets 100% line + branch coverage out of the box. Updates the stub registry: drop ``aime25`` from ``test_accuracy_config.STUB_BENCHMARKS``, drop ``is_implemented: false`` from the ``aime25`` plugins.yaml entry, switch ``default_grader`` to ``lighteval_expr``, add the ``aime25`` row to ``docs/accuracy/accuracy-benchmarking.md``, and move it from "Still Stubbed" to "Implemented" in ``accuracy_stubs.md`` (refreshing the Status Summary, Method Count Summary, and Suggested Implementation Order accordingly). Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
62d9fbe to
3e85d67
Compare
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
💤 Files with no reviewable changes (1)
✅ Files skipped from review due to trivial changes (2)
🚧 Files skipped from review as they are similar to previous changes (3)
WalkthroughImplements the AIME25 benchmark loader (lighteval-aligned) to load ChangesAIME25 Benchmark Implementation
🎯 3 (Moderate) | ⏱️ ~25 minutes
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/aiperf/accuracy/benchmarks/aime25.py (1)
49-49: ⚡ Quick winRemove type annotation from
**kwargs.The
**kwargs: Anyannotation should be removed. Based on learnings, variadic keyword arguments should remain untyped unless explicit named parameters are needed.♻️ Proposed fix
- def __init__(self, run: BenchmarkRun, **kwargs: Any) -> None: + def __init__(self, run: BenchmarkRun, **kwargs) -> None:Based on learnings: "In Python projects (e.g., in aiperf), avoid adding type annotations to **kwargs like **kwargs: Any. The variadic keyword arguments are inherently dynamic; leave **kwargs untyped or replace with explicit, named keyword parameters if a concrete contract is needed."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/aiperf/accuracy/benchmarks/aime25.py` at line 49, The __init__ signature for the class in aime25.py annotates variadic keywords as **kwargs: Any; remove the type annotation and change the signature to use untyped **kwargs so it reads def __init__(self, run: BenchmarkRun, **kwargs) -> None:, updating any references to __init__ if they assert the typed form and ensuring no additional named keyword parameters are required; this removes the unnecessary **kwargs: Any annotation while preserving behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/accuracy/accuracy_stubs.md`:
- Line 328: The docs currently give conflicting guidance for math_500 (which
mirrors AIME24Benchmark) by recommending pairing with lighteval_latex while
earlier listing math_500's default grader as math; update the accuracy_stubs
entry for math_500 to explicitly state that pairing with lighteval_latex is a
planned post-implementation transition (or else change the recommended grader to
match current default 'math') so contributors have a single source of truth;
reference the symbol math_500 and the grader names lighteval_latex and math when
making this clarification.
---
Nitpick comments:
In `@src/aiperf/accuracy/benchmarks/aime25.py`:
- Line 49: The __init__ signature for the class in aime25.py annotates variadic
keywords as **kwargs: Any; remove the type annotation and change the signature
to use untyped **kwargs so it reads def __init__(self, run: BenchmarkRun,
**kwargs) -> None:, updating any references to __init__ if they assert the typed
form and ensuring no additional named keyword parameters are required; this
removes the unnecessary **kwargs: Any annotation while preserving behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 0676f619-de8e-4ffa-acb4-481290ad2309
📒 Files selected for processing (6)
docs/accuracy/accuracy-benchmarking.mddocs/accuracy/accuracy_stubs.mdsrc/aiperf/accuracy/benchmarks/aime25.pysrc/aiperf/plugin/plugins.yamltests/unit/accuracy/test_accuracy_config.pytests/unit/accuracy/test_aime25_benchmark.py
💤 Files with no reviewable changes (1)
- tests/unit/accuracy/test_accuracy_config.py
aime25benchmark inplugins.yaml(default_grader: math,default_n_shots: 0); scaffold loader raisesNotImplementedErroruntil the full lighteval-backed implementation lands.LightevalExprGrader,expr_gold_metric) introduced in AIP-874.Reference:
trt-llm-benchmark-recipe/src/accuracy/acc_bench_lighteval.py(aime25task)Summary by CodeRabbit
New Features
Documentation
Tests
Chores